text extraction
Title block detection and information extraction for enhanced building drawings search
Lombardi, Alessio, Duan, Li, Elnagar, Ahmed, Zaalouk, Ahmed, Ismail, Khalid, Vakaj, Edlira
The architecture, engineering, and construction (AEC) industry still heavily relies on information stored in drawings for building construction, maintenance, compliance and error checks. However, information extraction (IE) from building drawings is often time-consuming and costly, especially when dealing with historical buildings. Drawing search can be simplified by leveraging the information stored in the title block portion of the drawing, which can be seen as drawing metadata. However, title block IE can be complex especially when dealing with historical drawings which do not follow existing standards for uniformity. This work performs a comparison of existing methods for this kind of IE task, and then proposes a novel title block detection and IE pipeline which outperforms existing methods, in particular when dealing with complex, noisy historical drawings. The pipeline is obtained by combining a lightweight Convolutional Neural Network and GPT-4o, the proposed inference pipeline detects building engineering title blocks with high accuracy, and then extract structured drawing metadata from the title blocks, which can be used for drawing search, filtering and grouping. The work demonstrates high accuracy and efficiency in IE for both vector (CAD) and hand-drawn (historical) drawings. A user interface (UI) that leverages the extracted metadata for drawing search is established and deployed on real projects, which demonstrates significant time savings. Additionally, an extensible domain-expert-annotated dataset for title block detection is developed, via an efficient AEC-friendly annotation workflow that lays the foundation for future work.
- Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Portugal > Porto > Porto (0.04)
Making History Readable
Banerjee, Bipasha, Goyne, Jennifer, Ingram, William A.
The Virginia Tech University Libraries (VTUL) Digital Library Platform (DLP) hosts digital collections that offer our users access to a wide variety of documents of historical and cultural importance. These collections are not only of academic importance but also provide our users with a glance at local historical events. Our DLP contains collections comprising digital objects featuring complex layouts, faded imagery, and hard-to-read handwritten text, which makes providing online access to these materials challenging. To address these issues, we integrate AI into our DLP workflow and convert the text in the digital objects into a machine-readable format. To enhance the user experience with our historical collections, we use custom AI agents for handwriting recognition, text extraction, and large language models (LLMs) for summarization. This poster highlights three collections focusing on handwritten letters, newspapers, and digitized topographic maps. We discuss the challenges with each collection and detail our approaches to address them. Our proposed methods aim to enhance the user experience by making the contents in these collections easier to search and navigate.
- North America > United States > Virginia > Montgomery County > Blacksburg (0.05)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
Enhancing Steganographic Text Extraction: Evaluating the Impact of NLP Models on Accuracy and Semantic Coherence
Li, Mingyang, Yuan, Maoqin, Li, Luyao, Pengsihua, Han
This study discusses a new method combining image steganography technology with Natural Language Processing (NLP) large models, aimed at improving the accuracy and robustness of extracting steganographic text. Traditional Least Significant Bit (LSB) steganography techniques face challenges in accuracy and robustness of information extraction when dealing with complex character encoding, such as Chinese characters. To address this issue, this study proposes an innovative LSB-NLP hybrid framework. This framework integrates the advanced capabilities of NLP large models, such as error detection, correction, and semantic consistency analysis, as well as information reconstruction techniques, thereby significantly enhancing the robustness of steganographic text extraction. Experimental results show that the LSB-NLP hybrid framework excels in improving the extraction accuracy of steganographic text, especially in handling Chinese characters. The findings of this study not only confirm the effectiveness of combining image steganography technology and NLP large models but also propose new ideas for research and application in the field of information hiding. The successful implementation of this interdisciplinary approach demonstrates the great potential of integrating image steganography technology with natural language processing technology in solving complex information processing problems.
Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering
McDonald, Tavish, Tsan, Brian, Saini, Amar, Ordonez, Juanita, Gutierrez, Luis, Nguyen, Phan, Mason, Blake, Ng, Brenda
Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Merced County > Merced (0.04)
- North America > Dominican Republic (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)
4 must-try new features in Windows 11's huge 2023 Update
Windows 11's 2023 Update is here, bringing with it a number of new features to explore. But which ones are worth trying? We've listed our favorites, below. Windows 11's 2023 Update is (eventually) being pushed to your PC as a free, cumulative update, which means that it encompasses features and applications that may have already arrived on your PC. Windows 11 users will receive most of the 2023 Update features by Nov. 14, though it may take longer for some systems.
Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English
Vasantharajan, Charangan, Tharmalingam, Laksika, Thayasivam, Uthayasanker
Most low-resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Format (PDF) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages, and many documents along with parallel corpora. Since Tamil and Sinhala are Low-Resource Languages, we improved the performance of Tesseract by employing LSTM-based training on more than 20 legacy fonts to recognize printed characters in these languages. Especially, our model detects code-mixed text, numbers, and special characters from the printed document. It is shown that this approach can reduce the character-level error rate of Tesseract from 6.03 to 2.61 for Tamil (-3.42% relative change) and 7.61 to 4.74 for Sinhala (-2.87% relative change), as well as the word-level error rate from 39.68 to 20.61 for Tamil (-19.07% relative change) and 35.04 to 26.58 for Sinhala (-8.46% relative change) on the test set. Also, our newly created parallel corpus consists of 185.4k, 168.9k, and 181.04k sentences and 2.11M, 2.22M, and 2.33M Words in Tamil, Sinhala, and English respectively. This study shows that fine-tuning Tesseract models on multiple new fonts help to understand the texts and enhances the performance of the OCR. We made newly trained models and the source code for fine-tuning Tesseract, freely available.
- Asia > Sri Lanka > Western Province > Colombo > Colombo (0.05)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.56)
Is Data a Differentiator for Your Business? If So, Traditional OCR Cannot Be An Answer - insideBIGDATA
If your business is driven by data, Optical Character Recognition (OCR) -- as most of us know it -- is not the answer. For those of you who view OCR as an industry staple for document processing, let me explain. OCR as a technology has been around for ages and it still has its place in processing unstructured document formats like PDFs, images, and other text formats that cannot be edited digitally. Users can quickly convert those files into editable documents. In short, it's a terrific technology for enabling you to edit and search for files that may have been "frozen."
Text Extraction in Python with Neural Networks
Image capture makes a snapshot in time of a person, place, or object. Many devices include cameras for taking pictures. This is integrated into everyday life. When taking the picture, there is recognition of that picture and often an autocorrection. Taking that further, there is Optical Character Recognition (OCR) that can take a picture of text and create a usable file that is same as document.
AI-Powered OCR -- Laying Groundwork for Automation? - DZone AI
We arguably live in one of the phenomenal eras witnessing technological disruption. We are transforming into a more digitalized world where businesses are going digital. Especially, when the recent pandemic situation has made us realize the importance of digitization and global connectivity. As a result, a countless number of physical documents have been digitized using advanced technologies. One of these is Optical Character Recognition (OCR).
Negative Statements Considered Useful
Arnaout, Hiba, Razniewski, Simon, Weikum, Gerhard
Knowledge bases (KBs), pragmatic collections of knowledge about notable entities, are an important asset in applications such as search, question answering and dialogue. Rooted in a long tradition in knowledge representation, all popular KBs only store positive information, while they abstain from taking any stance towards statements not contained in them. In this paper, we make the case for explicitly stating interesting statements which are not true. Negative statements would be important to overcome current limitations of question answering, yet due to their potential abundance, any effort towards compiling them needs a tight coupling with ranking. We introduce two approaches towards compiling negative statements. (i) In peer-based statistical inferences, we compare entities with highly related entities in order to derive potential negative statements, which we then rank using supervised and unsupervised features. (ii) In query-log-based text extraction, we use a pattern-based approach for harvesting search engine query logs. Experimental results show that both approaches hold promising and complementary potential. Along with this paper, we publish the first datasets on interesting negative information, containing over 1.1M statements for 100K popular Wikidata entities.
- North America > United States (0.14)
- South America > Chile (0.14)
- Oceania > New Zealand (0.04)
- (6 more...)
- Research Report (1.00)
- Personal > Honors (1.00)
- Leisure & Entertainment (1.00)
- Health & Medicine (1.00)
- Media > Film (0.68)
- Government > Regional Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.68)
- Information Technology > Communications > Web > Semantic Web (0.68)